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Bacterial natural products (NPs) and their analogs constitute more than half of the new small 
molecule drugs developed over the past few decades. Despite this success, interest in natural 
products from major pharmaceutical companies has decreased even as genomics has uncovered 
the large number of biosynthetic gene clusters (BGCs) that encode for novel natural products. To 
date, there is still a lack of universal strategies and enabling technologies to discover natural 
products at scale and speed. This review highlights several of the opportunities provided by 
genome sequencing and bioinformatics, challenges associated with translating genomes into 
natural products, and examples of successful strain prioritization and BGC activation strategies 
that have been used in the genomic era for natural product discovery from cultivatable bacteria. 

Role of NPs in Drug Discovery 

NPs have a proven track record of success in the history of drug discovery and development, and they 
continue to play a significant role [1,2]. Among all approved small molecule drugs worldwide from 
1981 to 2014, more than 50% of them are of NP origin or contain an NP pharmacophore [1]. This 
contribution is even more remarkable considering that many major pharmaceutical companies 
have switched their drug discovery programs from NPs to synthetic combinatorial libraries in the 
past 20-30 years. This shift has been primarily due to a false perception of diminishing numbers of 
novel scaffolds and a high rate of rediscovery of NPs yielded by traditional strategies [3,4], However, 
the advent of low-cost and rapid genome sequencing provides an informed method for targeted NP 
discovery, whether for structure, novelty, or function, thereby promising to eliminate these common 
pitfalls of the pregenomic era. Specifically, genome sequencing of bacteria, prolific sources of phar¬ 
maceutically relevant NPs, has revealed that known NPs represent just the tip of the iceberg relative 
to the vast diversity encoded within bacterial genomes [5], highlighting how many more NPs have 
never been pursued due to low or nonexistent production titers. Encouraged by this advancement 
and new challenges it presents, innovative strategies are continuously being developed, and novel 
NPs have been discovered by mining bacterial genomic information. This review highlights the op¬ 
portunities brought on by genome sequencing and advanced bioinformatics tools, the challenges 
that we still face, and some of the current strategies being utilized in the genome-directed discovery 
of new NPs from cultivable bacteria. 


Opportunities of the Genomic Era 

Even by the most modest models, the worldwide annual sequencing capacity is predicted to reach 
exa-base pairs (10 18 base pairs) by the early 2020s [6]. Although much of this capacity is dedicated 
to sequencing human genomes, at approximately 10 7 bases, bacterial genomes are a small fraction 
of the size and can be sequenced at far greater rates. Indeed, as of May 2019, public sequencing data 
(from the NCBI database 1 ), exists for more than 211 000 bacteria, providing rich genomic diversity 
(Figure 1 A) [7]. Several thousand more genomes are represented in metagenomic (see Glossary) da¬ 
tasets [7]. Of the 211 705 sequenced bacterial genomes, most represent human pathogens from the 
Firmicutes and Gammaproteobacteria phyla. By contrast, less than 10% of publicly available 
sequenced bacterial genomes belong to the prolific NP-producing Actinobacteria phylum. While 
pathogen genomes are undoubtedly important for studying human health, the much smaller per¬ 
centage of Actinobacteria demonstrates our opportunity for targeted genome sequencing of privi¬ 
leged bacteria for the purpose of discovering novel NPs. As our sequencing capacity increases, 
the cost decreases, recently falling under US$1000 for a human genome and significantly less than 
that for a bacterial genome - enabling any laboratory to take advantage of the genomic era to revo¬ 
lutionize NP discovery [8]. Bioinformatics studies have demonstrated that within a single bacterial 
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Figure 1. Untapped Potential of Bacterial Genomes. 

(A) The diversity of sequenced genomes in the NCBI database sorted by phyla. (B) The genome of the model 
organism Streptomyces avermitilis is depicted with the locations of 40 putative BGCs indicated. Gray arrows (23) 
designate orphan BGCs, while blue arrows (17) link a BGC with the structure of its NP. Abbreviations: BGC, 
biosynthetic gene cluster; NP, natural products. 


genome, there can be upwards of 30 biosynthetic gene clusters (BGCs), with most encoding for un¬ 
known NPs [9]. Indeed, a survey by the Joint Genomics Institute (JGI) has found that less than 0.25% of 
identified BGCs have been experimentally correlated to known NPs [10]. For example, in Strepto¬ 
myces avermitilis, one of the most well-studied bacterial strains, 23 of the total predicted 40 BGCs 
are silent under explored culturing conditions (Figure 1 B) [11]. Combined with estimates of the total 
number of bacterial species ranging from billions to one trillion [12,13]. the field is presented with a 
prime opportunity for NP discovery. However, unlike S. avermitilis and numerous other Actinobacte¬ 
ria, many of the currently sequenced bacteria have been sequenced based on pathogenicity rather 


Glossary 

antiSMASH: a bioinformatics tool 
that identifies and annotates sec¬ 
ondary metabolite BGCs from 
bacterial and fungal sequences. 
Biosynthetic gene cluster: physi¬ 
cally clustered set of genes that 
together encode the proteins 
responsible for the biosynthesis of 
a NP. This genetic organization is 
far more common in bacteria than 
in most eukaryotic genomes. 
BiG-SCAPE: a bioinformatics tool 
used to group BGCs into 
sequence similarity networks for 
exploration and classification. 
Combinatorial biosynthesis: the 
exploitation of biosynthetic path¬ 
ways using combinatorial strate¬ 
gies to produce NPs with altered 
structures. 

Core genome: all genes that are 
conserved throughout strains in a 
given species. 

CRISPR: DNA-editing tool that 
works by utilizing the specificity of 
the Cas9 enzyme to cleave DNA at 
a very specific site as dictated by a 
synthetic guide RNA. 

Degenerate primers: a mix of oli¬ 
gonucleotides with similar se¬ 
quences that cover all possible 
nucleotide combinations for a 
given protein sequence. 
Dereplication: the step in natural 
product discovery used to prevent 
rediscovery of natural products. 
Elicitors: small molecules that 
trigger the production of a sec¬ 
ondary metabolite. 

Genome mining: a search for a 
specific DNA sequence, often 
associated with a specific gene or 
BGC. This search may be facili¬ 
tated by either bioinformatics, for 
sequenced sources, or PCR, for 
unsequenced sources. 

Genome neighborhood network: 
a bioinformatics tool used for 
visualizing the protein families 
encoded by genes in proximity to 
the genes analyzed in an SSN. 
Hybrid NRPS-PKS: biosynthetic 
assembly line-like enzymes in 
which NRPS and PKS modules are 
both present, leading to a product 
with both amino acid- and acyl- 
CoA-derived moieties. 
Metabolome: collection of small 
molecules from a single biological 
sample, typically a single 
organism. 

Metagenomic: relating to the to¬ 
tal DNA from an environmental 
sample. 
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than for their biosynthetic potential, further highlighting the presented opportunity for targeted BGC 
discovery. We next highlight the challenges facing the community in taking full advantage of these 
opportunities. 

Challenges in Genome-Directed NP Discovery 

While the opportunities now exist for discovering more NPs than ever before, the key goals of 
genome-guided NP discovery are: (i) expansion of NP diversity; (ii) prioritization and prediction of 
BGCs; and (iii) rapid production of NPs from silent BGCs. For each goal, however, many challenges 
exist that hinder or prevent efficient exploitation of these opportunities. It is unknown how many per¬ 
mutations existfor bacterial NPs or even how many genomes must be sequenced to achieve a plateau 
in BGC discovery. Random selection of BGCs results in an increased number of NPs, but to take full 
advantage of NP diversity, BGCs must be prioritized by product novelty and/or function. Once cho¬ 
sen, the odds of the BGC of interest even being expressed under standard conditions by the native 
host are poor, although this challenge may be combated somewhat by discovering alternative pro¬ 
ducers for well-conserved (and presumably valuable) BGCs. Regardless, new technologies and 
defined strategies are required to position ourselves for success in NP discovery in the genomic era. 

Various bioinformatics tools are available for identification or classification of BGCs from genomic in¬ 
formation [14], but because these tools were developed based on current biosynthetic knowledge, 
novel families of BGCs may be missed or mischaracterized. The prediction of the NP structures 
from genetic information is another critical step for genome mining studies, as this step evaluates 
the structural novelty of NPs and facilitates their dereplication. Generally, accurate predictions can 
be made for the class of NP (e.g., polyketides, peptides, terpenes, etc.) and, in some cases, the 
core structures of the NPs [e.g., products of type I polyketide synthases (PKSs) and nonribosomal 
peptide synthetases (NRPSs)]. However, as opposed to core biosynthetic enzymes, the function of 
most tailoring enzymes or noncollinear enzymes cannot be predicted precisely, so predictions for 
the final NP structures are still far from adequate. An additional challenge for NPs, especially those 
with rare or poorly characterized self-resistance genes, which are arguably the most valuable, is 
the unpredictability of their bioactivities, making function-based prioritization difficult if not impos¬ 
sible. For silent or low-expressing BGCs, potentially making up 90% of all BGCs, translating the re¬ 
sults of genome mining and BGC prioritization into a NP is not always trivial [15]. Innovative strategies 
and generalizable synthetic biology tools need to be developed to access these BGCs and identify 
the encoded NPs. 

Ultimately, while research groups are working to overcome these and other challenges associated 
with genome-guided NP discovery, the development of any universal tools or strategies will require 
advances in several fields, including chemistry, synthetic biology, microbiology, and computational 
biology. Few academic laboratories have the resources to span most of these fields, much less all 
of them at once, and thus, larger, interdisciplinary collaborations across institutes are needed. 

Strategies and Selected Examples for Genome-Based Discovery of Novel NPs 

Several strategies and enabling technologies have been developed to facilitate NP discovery from 
bacteria to address many of the initial challenges highlighted above, though many still remain. Often, 
the best strategy to choose depends primarily on the entry point into the discovery pipeline: whether 
the strain is sequenced or not and how specific the targeted structure or function is. Below, we high¬ 
light some of the most successful and generalizable strategies and examples thereof. 

BGC Prioritization from Sequenced and Unsequenced Sources 

While the community has acquired a substantial collection of sequenced bacterial genomes, the full 
biosynthetic potential mostly lies in the remaining unsequenced strains, and accessing this additional 
diversity is critical to the ongoing success of NP discovery programs. Several successful strategies 
have been developed to take advantage of one or both sources, typically focused on structural nov¬ 
elty (Figure 2, Key Figure). As sequencing data accumulate, the emphasis of BGC prioritization will 
necessarily shift from selecting an interesting gene or BGC to the prioritization of which novel 


Nonribosomal peptide synthe¬ 
tases: assembly line-like enzymes 
that can be divided in multi- 
domain modules, with each mod¬ 
ule responsible for the non¬ 
ribosomal incorporation and 
tailoring of a single amino acid 
into the scaffold of a small mole¬ 
cule product. 

Orphan BGCs: BGCs that have 
not been experimentally corre¬ 
lated with a NP. 

Orthologous group: group of 
genes or proteins with the same 
function in different species that 
are related by a common 
ancestor. 

Pangenome: all genes from every 
strain of a particular species. 
Polyketide synthases: assembly 
line-like enzymes that can be 
divided in multidomain modules, 
with each module responsible for 
the decarboxylative incorporation 
and tailoring of a single acyl-CoA 
substrate (or analog) into the 
scaffold of a small molecule 
product. 

Real-time PCR (or quantitative 
PCR): technique that couples 
amplification of targeted DNA 
with real-time quantification of the 
amplified DNA through the use of 
fluorescently labeled reporters. 
Sequence similarity network: a 
bioinformatics tool used for visu¬ 
alizing large sets of sorted protein 
sequences with different strin¬ 
gency levels. 

Silent BGC: BGC that does not 
produce a NP under tested cul¬ 
ture conditions. 

Tailoring enzyme: acts to deco¬ 
rate a NP scaffold with additional 
moieties, for example, hydroxyl or 
methyl groups. 

T ransformation-associated 
recombination: technique that 
takes advantage of yeast's pro¬ 
pensity for initiating homologous 
recombination. This is an espe¬ 
cially powerful tool for cloning 
large BGCs from genomic DNA. 


Trends in Pharmacological Sciences, January 2020, Vol. 41, No. 1 


15 


Trends in Pharmacological Sciences 


Cell 

REVIEWS 


Key Figure 

Strategies for NP Discovery 



Trends in Pharmacological Sciences 


Figure 2. The schematic depicts the general strategies that are available to identify and prioritize BGCs from 
sequenced (top half) and unsequenced (bottom half) sources. Partial structural predictions are possible for 
some enzymes, purely from sequencing data, such as for the depicted hypothetical NRPS in which each module, 
represented by a different color, can incorporate a corresponding amino acid into the final structure. A series of 
computational tools exist for identifying and comparing individual genes or BGCs (e.g., an SSN) where each 
color represents a different family of genes or proteins. From unsequenced strain collections, new candidate 
BGCs can be prioritized via real-time PCR screening based on information obtained from sequenced genomes, 
such as those from promising NP families (e.g., the enediynes, whose core structure is highlighted in red), or 
through resistance gene-guided assays (orange genes) for target prediction. Once an unsequenced strain has 
been selected, the accessibility of genome sequencing provides the ability to feed back into the sequenced 
databases and identify a specific BGC. Abbreviations: BGC, biosynthetic gene cluster; NP, natural product; 
NRPS, nonribosomal peptide synthetase; SSN, sequence similarity network. 


gene or BGC among thousands or millions to select. Mirroring this trend, bioinformatics tools have 
improved significantly to the point that it is now relatively simple to group related genes by sequence 
[sequence similarity network (SSNs)]" [16], genomic proximity [genome neighborhood network 
(GNNs)]" [16,17], and BGC family (BiG-SCAPE)'" [18], or to predict NP BGCs (antiSMASH iv or PRISM V ) 
[19-21]. These tools are typically utilized in combination with databases of characterized BGCs; for 
example, Minimum Information about a Biosynthetic Gene Cluster (MIBiG) vl [22]. Used together, 
large numbers of diverse sequences can be rapidly sorted and novelty can be identified. This novelty 
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could result in an entirely new class of NPs or diversity among a family of known clinically relevant NPs 
or could be used to identify genetically amenable and/or high-producing alternative producers of 
known NPs. 

Key to discovering novelty is our ability to make predictions for the NP from genetic sequence. Our 
ever-increasing knowledge of NP biosynthesis allows the correlation of a structural motif to its corre¬ 
sponding biosynthetic enzymes or domains. Using this as inspiration, methods have been developed 
to quickly scan the genomes of sequenced and unsequenced strains to identify BGCs encoding the 
type of enzymes of interest with the potential to produce NPs featuring the desired structural motifs 
[23-27], This is especially true for polyketides [28,29], ribosomal and nonribosomal peptides [21 ,30], 
and sometimes terpenoids [31]. These predictions gain even greater importance within a family of 
related BGCs by showcasing natural combinatorial biosynthesis, as for the leinamycin family 
described later [23], 

Importantly, genome-guided NP discovery is not limited to bacteria with sequenced genomes. 
Instead, existing genomic data can be utilized to inform screening campaigns for unsequenced bac¬ 
teria, which, upon sequencing any bacteria of interest, result in an even stronger genetic database 
(Figure 2). Targeting unsequenced bacteria is an important strategy as public genomes are not 
currently biased towards high-level NP producers. Crucially, there are enough public genomic 
data (NCBI database 1 ) available that degenerate primers can be designed for genome mining of un¬ 
sequenced bacteria with a specific chemical moiety, as with the phosphonate family of NPs high¬ 
lighted below [27], These screens often utilize both conventional and real-time PCR (rt-PCR) to iden¬ 
tify BGCs encoding members of a selected NP family, much in the same way a BLAST V " search of a 
sequenced database would work [23,25,32,33], A further strategy for prioritizing bacteria, sequenced 
or unsequenced, is guided by resistance genes, rather than by the biosynthetic genes themselves. 
This strategy targets NP functions and is detailed further in a separate section. 

Additionally, BGC prioritization does not have to be dependent on currently sequenced bacteria, but 
rather, it depends in large part on the choice of bacterial strains to be sequenced in the future. As 
stated, less than 10% of bacterial genomes currently in the NCBI database belong to the NP-rich Ac- 
tinobacteria (Figure 1 A). Thus, by balancing NP-privileged and rare taxa in future sequencing efforts, 
more BGC diversity can be sampled than is currently available. This diversity can be a result of phy- 
logeny (e.g., the rare Streptosporangium versus the common but prolific Streptomyces) or ecology 
(e.g., bacteria from a marine sponge versus bacteria from rainforest soil). While it is impossible to pre¬ 
dict exactly which strains will be the best NP producers solely from phylogeny or ecology, by sampling 
diversity using both factors, future prioritization will be better informed by a stronger genomic and 
BGC database. 

Finally, when sequencing bacterial genomes for the purpose of NP discovery, it is critical to consider 
the quality of the resultant sequences. As BGCs can span over 100 kb in length and are dependent on 
accurate sequencing of a large number of genes for product structure and function predictions, ge¬ 
nomes with many errors or large numbers of contigs could be considered detrimental to the discovery 
process. However, while a complete genome sequence is ideal, more research is needed on thistopic 
to determine what is the minimally acceptable genome quality for NP discovery efforts. 

Discovery of the Leinamycin (LNM) Family of NPs: Scaffold-Directed Genome 
Mining from Sequenced and Unsequenced Strain Collections 

LNM contains a unique 1,3-dioxo-1,2-dithiolane moiety spirofused to an 18-membered macrolactam 
ring (Figure 3A) and is a promising anticancer drug lead due to its potent activity and unprecedented 
mode of action [34], However, despite being discovered from Streptomyces atroolivaceus S-140 in 
1989, no additional natural producers or analogs were reported for nearly 30 years [35], In 2002, 
the LNM BGC was identified and shown to encode a hybrid NRPS-PKS consisting of two NRPS mod¬ 
ules and six PKS modules (Figure 3A) [36,37], In 2017, to identify other LNM-like NPs, a structure-tar¬ 
geted genome mining strategy was applied to the publicly available bacterial genomes (NCBI 
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Figure 3. Discovery of the LNM Family of NPs. 

(A) The Inm core biosynthetic genes are depicted with the color representing the encoded enzyme type: NRPS (blue), polyketide synthase (red), and the 
DUF-SH didomain (green). The structure of LNM is also shown with its colors corresponding to the biosynthetic origin of the specific molecular region. 
The variability of the core structure is organized by module based on the predicted products of the 49 /nm-type BGCs discovered. (B) Approximately 
50 000 bacterial genomes were searched in silico for sequences containing a DUF-SH didomain, resulting in the identification of 19 /nm-type BGCs and 
the design of degenerate primers for the PCR-guided discovery of /nm-type BGCs from unsequenced strains. (C) Real-time PCR of the DUF-SH 
didomain in 5000 unsequenced Actinobacteria resulted in the further identification of 30 more /nm-type BGCs. (D) Two novel LNM-type NPs, 
guangnanmycin A and weishanmycin A1, were isolated as representatives from two of the 18 LNM-type clades. The sulfurs highlighted in green are 
predicted to be installed by DUF-SH didomains. Abbreviations: BGC, biosynthetic gene cluster; LNM, leinamycin; NP, natural product; NRPS, 
nonribosomal peptide synthetase. 


database 1 ) and 5000 in-house unsequenced Actinobacteria strains (Figure 3B,C) [23], As the LNM C-3 
sulfur is known to be incorporated by a unique DUF-SH didomain (Figure 3A), it was proposed that 
this genomic region could be used to identify additional /nm-type BGCs. With the DUF-SH sequence 
as a query, 19 /nm-type BGCs could be detected in genomes from the NCBI database. The new 
sequence information provided the conserved regions of the DUF-SH didomain necessary for the 
design of degenerate primers, and the subsequent high-throughput rt-PCR screen of the in-house 
strain collection resulted in the discovery of an additional 30 Inm -type BGCs, for a total of 49 BGCs 
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[23], Thus, the utility of untargeted public sequencing data was demonstrated for the targeted discov¬ 
ery of rare BGCs in an unsequenced strain collection while simultaneously highlighting the incom¬ 
pleteness of the public database. 

In addition to the discovery of new Inm -type BGCs, sequencing of the in-house strains confirmed 
the existence of as many as 18 distinct /nm-type clades, resulting in permutations of the predicted 
LNM-type structures corresponding to products of six of the eight NRPS or PKS biosynthetic modules 
(Figure 3A) [23], Bioinformatics tools can predict an approximate core structure based on the 
sequences of these modules [20,21]; however, it is still nearly impossible to predict the final structure 
of NPs from their encoding genetic information. Notably, two new LNM-like NP families, the guang- 
nanmycins (GNMs) and the weishanmycins (WSMs), were isolated from strains Streptomyces sp. 
CB01883 and Streptomyces sp. CB02120-2, respectively (Figure 3D) [23], Both the GNMs and 
WSMs differ substantially from LNM in structure, showcasing the structural diversity produced by 
the varying /nm-type BGCs. This example highlights the power of genome mining as a general 
strategy for surveying sequenced and unsequenced strains for targeted, structure-based discovery 
of NPs. 

Discovery of Phosphonate Family of NPs: High-Throughput Genome Mining and 
Sequencing from an Unsequenced Strain Library 

The phosphonate family of NPs have an impressive track record to function as pharmaceutically rele¬ 
vant NPs due to their innate similarity to phosphate-containing nutrients [38], Like the example of 
LNM reviewed above, the biosynthetic machinery responsible for this pharmacophore can be iden¬ 
tified by the presence of a core gene, pepM, which in this case encodes for phosphoenolpyruvate 
phosphomutase. However, in stark contrast to the rare LNM family, many representatives of the phos¬ 
phonate family were known prior to an intensive genome mining project [39] As such, degenerate 
primers targeting pepM were readily established and used to screen multiple unsequenced strain 
collections totaling ~10 000 Actinobacteria (Figure 4A) [27], Draft genome sequencing of 403 strains 
identified by this screen confirmed the presence of pepM in more than two-thirds of the candidates 
(Figure 4B). After dereplication, a diverse collection of 192 strains encoding 78 distinct phosphonate 
BGCs remained (>85% novel). Bioinformatics and statistical analysis was used to determine the 
collection represented ~62% of a predicted 125 possible phosphonate BGCs, requiring screening 
of ~40000 additional strains before saturating Actinobacteria phosphonate BGCs (Figure 4D) [27,40], 

In addition to the valuable genomic information, this genome mining effort also resulted in several 
new phosphonates. Of the 45 putative phosphonate producers generating indicative 31 PNMR signals 
(23%), three strains that displayed positive results in response to a phosphonate-specific bioassay us¬ 
ing an engineered Escherichia coli with inducible hypersensitivity to phosphonate antibiotics [41] 
(Figure 4C) were focused upon. From these strains, a rare sulfur-containing phosphonate, argolaphos 
B, containing a rare amino acid (N 5 -hydroxyarginine), was isolated and showed potential broad-spec¬ 
trum antibiotic activity (Figure 4E). Hence, here, similarto the LNM genome mining, the boundaries of 
a NP family's diversity could be pushed, but instead of focusing on a single scaffold, a more relaxed 
strategy provided a view of the overall diversity of NPs containing a specific moiety (phosphonate) 
produced by a single class of bacteria (Actinobacteria). 

Targeting Resistance Genes for NP Discovery 

As a general strategy that can be applied to either sequenced or unsequenced genomes, resistance- 
conferring genes can be targeted for discovery of NPs with predictable targets or modes of action. 
This strategy, while gaining popularity in recent years, is still limited by the need for knowledge of 
rare or unusual resistance mechanisms, but it has yielded NPs with an impressive array of bioactivities 
[42-46], Targets can include duplicated housekeeping genes (e.g., fatty acid synthase or proteasome 
components) [42-44], genes encoding target eukaryotic proteins (e.g., dihydroxyacid dehydratase, a 
known herbicide target) [45], or other rare resistance genes (e.g., a gene associated with topoisom- 
erase inhibitors) [46], By looking for resistance markers, expensive and time-consuming functional as¬ 
says can be partly replaced by genetic screening, without any bias towards or prior knowledge of a 
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Figure 4. Genome Mining of Phosphonate NPs. 

(A) A PCR screen targeting the pepM gene was used to identify phosphonate BGCs from 10 000 Actinobacteria. (B) 
Hit strains from the PCR screen were sequenced and phosphonate BGCs identified therein. (C) Extracts from strains 
containing phosphonate BGCs were assayed with an engineered Escherichia coli strain hypersensitive to 
phosphonates. (D) Bioinformatics and statistics were used to map the diversity of phosphonate BGCs and to 
extrapolate how much phosphonate diversity remains in Actinobacteria. (E) Examples of novel phosphonate 
compounds isolated from genome mining are shown. Abbreviation: BGC, biosynthetic gene cluster. 


final NP structure. To highlight this strategy, the resistance-based discovery of the thiotetronic acid 
antibiotics is discussed below [42]. 

Discovery of Thiotetronic Acid Antibiotics: Self-Resistance (Target)-Directed 
Genome Mining and BGC Prioritization 

Using a resistance-directed genome mining strategy, the genomes of 86 Salinispora strains were bio- 
informatically surveyed for BGCs that contained putative resistance genes encoding the protein 
target of the NP (Figure 5A) [42], This analysis resulted in the identification of all orthologous groups 
(OGs) within the Salinispora pangenome and core genome (Figure 5A, step i). The core genes were 
classified by sequence similarity into clusters of orthologous groups (COGs), and any nonconserved 
OG from the pangenome that fit into a COG from the core genome was considered duplicated (Fig¬ 
ure 5A, step ii). Among the duplicated OGs, nearly 40% were associated with secondary metabolite 
BGCs (Figure 5A, step iii) [42], 

To this point, the findings could be generalized to any target identified among the duplicated OGs, 
but in this study, OGs related to lipid transport and metabolism were focused on, with the goal of 
finding inhibitors of bacterial fatty acid biosynthesis (Figure 5A, step iv). Salin8269, one of 12 proteins 
from Salinispora with homology to the fatty acid elongation enzymes FabB/F, also shows high 
sequence similarity to the self-resistant proteins PtmP3 and PtnP3 from the producers of fatty acid 
synthase inhibitors platensimycin and platencin, respectively, and is encoded within the hybrid 
peptide-polyketide thiolactomycin (t/m) BGC from Salinispora pacifica CNS-863 (Figure 5A, step v) 
[43], The t/m BGC was cloned through a modified transformation-associated recombination (TAR)- 
based platform and heterologously expressed in Streptomyces coelicolor Ml 152. Subsequently, a 
group of NPs featuring a rare thiotetronic acid moiety, including the previously reported fatty acid 
synthase inhibitor thiolactomycin (TLM) as well as three analogs (one novel), were isolated (Figure 5B) 
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Figure 5. Discovery of thiotetronic acid antibiotics. 

(A) The tlm BGC was prioritized by searching through the Salinispora pan-genome for duplicated housekeeping genes for BGCs as depicted. First, the core 
OGs present in all Salinispora were identified and grouped by similarity (i). Any duplicated OGs from the core genome were then examined (ii), and those 
found within predicted BGCs were analyzed (iii). Further categorization and prioritization yielded a single homologue, located within the tlm BGC (iv and v). 

(B) The structures of several known and novel thiotetronic acid natural products isolated using this method are shown. Abbreviations: BGC, biosynthetic 
gene cluster; COGs, clusters of orthologous groups; OGs, orthologous groups. 


[47,48] A second BGC, the thiotetroamide ( ttm ) BGC from Streptomyces afghaniensis, was similarly 
cloned on the basis of its resemblance to the tlm BGC and the presence of two putative self-resis¬ 
tance genes, ttmE and ttmj, with both also showing high similarity to ptmP3 and ptnP3 [43], 
Heterologous expression of the ttm BGC afforded four more TLM analogs, including thiotetroamide 
(TTM) C, among others (Figure 5B). 

This study showcases a genome mining strategy that targets BGCs with duplicated housekeeping 
genes that may encode protein targets of the NPs. With this strategy, it is now possible to infer 
the target of an uncharacterized NP by analyzing the BGC-associated self-resistance genes without 
a priori knowledge of the NP structure. 

Activation of Silent BGCs 

Once one or more BGCs have been prioritized, low titers or silent BGCs can still hinder or prevent 
further testing without additional tools in the NP discovery pipeline. The abundance of silent orphan 
BGCs in bacterial genomes has inspired the development of various methods for activation [15], 
which can be generally categorized into BGC-targeted and untargeted approaches. The section 
below reviews these approaches in brief. 

Untargeted approaches aim to alter the metabolome of strains through indiscriminate techniques such as 
media optimization [49,50], addition of elicitors [51], ribosome engineering [49,52], metabolic engineering 
[53-55], and manipulation of global regulatory [56] and protein modification [57] genes (Box 1). Although 
successful for NP discovery, traditional untargeted approaches are not applicable to activate specific 
BGCs of interest and suffer from many pitfalls. For example, the overexpression of phosphopantetheinyl 
transferases, which are responsible for post-translational covalent modification of carrier proteins in fatty 
acid, polyketide, and nonribosomal peptide biosynthesis, resulted in 23 of 33 Actinobacteria strains pro¬ 
ducing new NPs [57], However, without further investigation, the new NPs could not be identified, as any of 
several BGCs encoding a carrier protein could have been activated. For a targeted approach, activation of 
a single BGC will provide the useful genotype-phenotype link that often lacks with these untargeted ap¬ 
proaches. Similarly, untargeted approaches such as those described in Box 1, affect global regulation or 
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metabolic flux, which in turn can improve yields from a wide range of BGCs without requiring detailed 
knowledge about a specific BGC (or even a genome in some cases) [51]. Despite their untargeted nature, 
many of these strategies can be applicable when targeting a BGC that has demonstrated low but detect¬ 
able production, as the same transcriptional or translational bottlenecks that may lead to BGC silence may 
also result in low production [58]. In this case however, the genotype-phenotype link would have already 
been established, potentially via more targeted approaches. 

Alternatively, targeted approaches are designed to activate specific BGCs by a variety of sequence- 
dependent techniques including heterologous expression, promotor exchange, BGC refactoring, and 
BGC-specific regulator manipulation (Box 1) [56,59-63]. While typically lower throughput, targeted ap¬ 
proaches often can take advantage of both the vast genomic information and cutting-edge genome edit¬ 
ing technologies such as CRISPR [64,65] and recombineering [66,67]. Two other especially popular tar¬ 
geted activation strategies involve isolation of a single BGC from its native environment: BGC 
refactoring [59,68,69] and heterologous expression [42,56,70-72] (Box 1). As both strategies are predicated 
primarily on breaking transcriptional regulatory networks, they are often used in conjunction. Heterolo¬ 
gous expression is possibly the most popular targeted activation strategy, likely because of two other 
key advantages: (i) the BGC of interest is introduced into a characterized environment, making identifica¬ 
tion of any new NPs simpler; and (ii)the new host has often been domesticated [1 1] and is generally more 
genetically amenable [58]. However, BGC-specific regulators and low titers due to poor transcriptional or 
translational throughput are both major hurdles, even in a heterologous host. 

BGC refactoring is synthetic biology's most thorough (and time-intensive) answer to BGC activation. In this 
process, known genetic elements (promoters, ribosome binding sites, and terminators) are coupled with 
each gene from the targeted BGC, yielding a new BGC that contains as little or as much regulation as 
desired with controlled transcription and translation rates [73,74]. While refactoring is potentially powerful, 


Box 1. Strategies for BGC Activation 

BGC activation is considered untargeted when the strategy is not specific to a single BGC of interest but rather 
affects some or all BGCs encoded within an organism. In addition to the more traditional media optimization or 
addition of elicitors, four untargeted approaches are detailed below (Figure I, left). 

(A) Ribosome Engineering. Some BGCs are silent at the transcriptional level while others are silent at the trans¬ 
lational level. For those BGCs with a translational bottleneck, the bacteria can be subjected to sublethal levels 
of ribosome-targeting antibiotics, resulting in the evolution of the ribosome and improved translation. 

(B) Metabolic Engineering. To achieve higher titers of NPs, competing and supporting pathways (often 
involving substrates or cofactors) can be disrupted or augmented, respectively. 

(C) Manipulation of Global Regulatory Genes. Many BGCs are regulated by proteins encoded outside the BGC 
boundaries, and deletion or overexpression of these negative or positive regulators, respectively, may activate 
one or more BGCs in a bacterial host. 

(D) Manipulation of Protein-Modification Genes. The addition of certain genes, such as a phosphopantetheinyl 
transferase, may result in the post-translational activation of key proteins encoded by silent BGCs. When tar¬ 
geting a specific BGC, several options are available for activation in addition to the untargeted strategies. Like 
the untargeted strategies, these targeted strategies can affect multiple levels of silence, and they are 
described below (Figure I, right). 

(E) Fleterologous Expression. The targeted BGC can be expressed in a heterologous host to disrupt native reg¬ 
ulatory networks and to enable improved genetic amenability. 

(F) Promoter Exchange. Promoters from the targeted BGC can be replaced with non-native promoters to re¬ 
move transcriptional regulation. 

(G) BGC Refactoring. The targeted BGC can be reorganized with predictable transcriptional (promoters, ter¬ 
minators) and translational (ribosome-binding sites) elements. 

(FI) Manipulation of BGC-Specific Regulatory Genes. Deletion or overexpression of BGC-specific negative or 
positive regulators, respectively, can be used to overcome transcriptional silence. 
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Figure I. Strategies for BGC Activation. 

Untargeted strategies are illustrated in the left panel (A-D), and targeted strategies are illustrated in the right 
panel (E—H). Genes are color coded as follows: core biosynthetic genes (blue), negative regulatory genes (red), 
protein modification genes (green), and positive regulatory genes (orange). 


genetic elements are not universal in bacteria, so genetic toolboxes must be developed for each genetic 
system [75], Additionally, refactoring of a single BGC requires both the assembly and balancing of poten¬ 
tially dozens of genetic parts, and this hypervariability results in extensive troubleshooting during and after 
assembly. Alternatively, intermediate steps such as BGC-specific regulator manipulation [60] or promoter 
exchanges [44,76] can also be utilized in orderto overcome regulatory ortranscriptional challenges, usually 
with far less effort than full BGC refactoring. 

Ultimately, both targeted and untargeted activation approaches have their own merits, and depend¬ 
ing on the chosen overall strategy for discovery, either may be preferred. However, as genomic data 
continue to increase and the targeted activation tools become more developed and mainstream, it is 
expected that researchers will continue to shift towards targeted approaches to better access their 
prioritized BGCs. 


Concluding Remarks and Future Perspectives 

NP discovery, especially over the past several decades, has struggled to reach the heights of its 
golden age in the 1950s and 1960s, but the field appears poised for a renaissance driven by advance¬ 
ments in genomics, bioinformatics, and synthetic biology [2], Many of the reasons commonly listed for 
the decline in NP discovery, such as rediscovery, now can be alleviated by genomic information. 
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Dropping sequencing costs, the availability of powerful bioinformatics tools, and the realization of 
near limitless BGCs have provided the impetus needed for the field to re-focus on novel NP discovery. 
However, several pertinent questions remain and are addressed below (see Outstanding Questions). 

Genome sequencing has significantly advanced the field, but the rate of discovery of NPs cannot keep 
pace with the sequencing of their encoding BGCs. Eventually, the rate of BGC discovery will begin to 
plateau, though that point has not yet been determined. Without BGC sequencing as the rate- 
limiting step, prioritization of BGCs for NP discovery has become one of the key strategic questions 
facing the field moving forward. How can novelty as a BGC/NP trait be screened and sorted for? The 
solution will not come from a single source, but rather, it is a question of developing multidisciplinary 
collaborations between chemists, synthetic biologists, microbiologists, and computational biolo¬ 
gists, among others, to address the wide range of smaller challenges. Of special interest is the con¬ 
struction of a public BGC database to assess total NP diversity and to survey microbial biodiversity 
[77]. Likewise, the development of platform technologies that can be rapidly embraced by the entire 
field and utilized at both scale and speed (such as synthetic biology tools in non-model hosts or stan¬ 
dardized computational pipelines) is critically important for realizing the genetic information all the 
way through to the point of discovery. Fortunately, despite all these questions, the field of NP discov¬ 
ery is making great progress in the genomic era and shows strong signs of continuing to develop. 
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Outstanding Questions 

How many genomes must be 
sequenced before BGC discovery 
plateaus? 

How should strains be prioritized 
for genome sequencing to most 
efficiently discover novel NPs? 

Is it possible to develop tools for 
targeting structural and functional 
diversity from genomic informa¬ 
tion? 

What strategy is most effective for 
the prioritization of BGCs for tar¬ 
geted expression? 

Is it possible to develop a universal 
strategy to translate targeted BGCs 
into NPs at both scale and speed? 
Does the quality of genome se¬ 
quences matter; that is, draft versus 
fully assembled genomes, espe¬ 
cially with regards to natural prod¬ 
uct discovery? 

How can we encourage better 
collaboration between chemists, 
synthetic biologists, microbiolo¬ 
gists, and computational biologists 
towards developing the necessary 
technologies for a natural product 
discovery pipeline? 


Resources 

'ftp://ftp.ncbi.nlm.nih.gov/genomes/ 
"https://efi.igb.illinois.edu/efi-est/ 
'"https://omictools.com/big-scape-tool 
,v https://antismash. secondarymetabolites.org/ 
v https://om ictools.com/prism-7-tool 
v 'https://mibig. secondarymetabolites.org/ 
v "https://blast. ncbi.nlm.nih.gov/Blast.cgi 
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